nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL#2118
Conversation
This comment has been minimized.
This comment has been minimized.
rwgk
left a comment
There was a problem hiding this comment.
I used Cursor GPT-5.5 1M Extra High
GPT Findings
-
Medium:
cuda_bindings/docs/source/release/13.2.0-notes.rstand
cuda_bindings/docs/source/release/13.3.0-notes.rstgive an incomplete
raw-NVML workaround. The PR and nvbug context indicate the corruption happens
whennvmlDeviceGetComputeRunningProcesses_v3first primes the PID cache, so
setting the locale to"C"only beforenvml.system_get_process_namecan be
too late if the cache was already populated. The notes should say to prime and
read under"C", matching thecuda.coreworkaround. -
Medium:
cuda_core/cuda/core/system/_system.pyxnow makes
get_process_name()depend on successfully enumerating compute processes on
every device. Any unrelated device-level NVML failure now breaks a per-PID
lookup, including on non-WSL where this is a behavior change. Consider trying
the direct read first on non-WSL, or making priming failures narrower. -
Low:
cuda_core/docs/source/release/1.1.0-notes.rstsays the WSL workaround
may hold a global lock, but the implementation uses POSIX per-thread locale
APIs and no global lock is present. That note looks stale or misleading.
Lightweight Thread-Safety Recommendation
The new locale switching implementation is reasonably thread-safe because it
uses POSIX newlocale/uselocale/freelocale, which scope the "C" locale to
the calling OS thread rather than mutating process-global locale state.
The remaining race is around NVML's process-name cache, which appears to be
process/global driver state. On WSL, another thread could call a cache-priming
path such as nvmlDeviceGetComputeRunningProcesses_v3 under a non-"C" locale
between get_process_name()'s prime and read steps, reintroducing corrupted
cached data.
A lightweight improvement would be to add a module-private Python
threading.RLock shared by the cuda.core paths most likely to touch this
cache. Hold it around:
system.get_process_name()'s WSLc_locale_guard()+ prime + read sequence.Device.compute_running_processeson WSL, since that is the main in-package
path that can prime the process-name cache.
This would not protect raw cuda.bindings.nvml.* calls or external users, but
it would cover the most likely cuda.core race without globally serializing all
NVML access. A Python RLock is sufficient here: even if Cython releases the GIL
inside NVML wrappers, the lock remains held until the surrounding Python with
block exits.
| ------------ | ||
|
|
||
| * Updating from older versions (v12.6.2.post1 and below) via ``pip install -U cuda-python`` might not work. Please do a clean re-installation by uninstalling ``pip uninstall -y cuda-python`` followed by installing ``pip install cuda-python``. | ||
| * ``nvml.system_get_process_name`` on WSL can return incorrect values. To work around this, set the locale to "C" before calling ``nvml.device_get_compute_running_processes_v3`` (which sets the process names) and before calling ``nvml.system_get_process_name``. ``cuda_core`` does this automatically, but users of the raw NVML API will need to do this manually. |
There was a problem hiding this comment.
Why is this in both 13.2 and 13.3 release notes?
There was a problem hiding this comment.
It's a compatibility note that's in line with the existing v12.6.2.post1 and below note, which we've been carrying forward since 12.8.0, I believe in every single release since then:
$ git grep 'v12.6.2.post1 and below' | cutniq | cut -d/ -f-4 | uniq -c
16 cuda_bindings/docs/source/release
15 cuda_python/docs/source/release
There was a problem hiding this comment.
I was working on the assumption that we put "known issues" on previous releases. (Since the issue applies to all of them).
My 2c: We very intentionally don't do any magic in |
That's the conclusion I'm coming to, too. I discussed this rationale with GPT-5.5:
Additional idea, to be helpful to our users: Separately from the cuda_core workaround, I still think cuda_bindings would benefit from routing C-string decoding through a tiny helper, to make failures actionable. The current UnicodeDecodeError tells us only the first bad byte and position; in this case the useful information was the raw buffer contents and which NVML call produced them. A helper could keep the success path identical, but on decode failure raise an exception/message that includes the source API name plus a bounded repr/hex dump of the bytes. That would have made the nvbug much easier to diagnose and should be generally useful for future driver/library string issues. — I would avoid special system_get_process_name behavior; it's too niche. |
|
I cancelled the CI manually after seeing that Python 3.12 with 12.9 is hanging again (two jobs). I'll report the details on the tracking bug we have already. I triggered a rerun. |
|
The hanging jobs from the previous attempt are now tracked here: |
|
It looks like the jobs tracked under #2004 (comment) are hanging again. Maybe the driver update made the issue a lot worse than before? Each time this happens, two runners are blocked for 4 hours, and of course merging the PR is blocked, too. Interestingly, I just see this: But after that it's still hanging. |
|
Motivated by the experience here, I created #2122 — [ENH]: Make cuda_bindings UnicodeDecodeError more actionable |
|
The two jobs were hanging again in the 4th attempt. I cancelled them to unblock the runners. In the meantime @aryanputta sent PR #2121, I just triggered the CI there. I'll come back here to try again after PR #2121 is merged. |
This comment has been minimized.
This comment has been minimized.
|
Summary
cuda.core.system.get_process_name(pid)raisesUnicodeDecodeErrorunderWSL whenever the calling process has a non-
Clocale (which is the defaultstate for any CPython process, since the interpreter calls
setlocale(LC_ALL, "")at startup). This is reproducible by running thecuda_coretest suite with any seed that schedulestests/system/test_system_device.py::test_compute_running_processesbeforetests/system/test_system_system.py::test_get_process_name.The underlying defect is in NVML's WSL implementation (see Root cause
below). This PR adds a scoped, defensive workaround in
cuda_coreso thepublic API returns a correct value on WSL. It also fixes a latent issue
where
get_process_namewas effectively unusable from a fresh processbecause it never primed NVML's per-PID name cache.
Root cause: the WSL mojibake
NVML's
nvmlSystemGetProcessNameon WSL takes a different code pathdepending on the process's current locale. With the default
"C"locale,the function returns the basename portion of
/proc/<pid>/execorrectly.With any other locale (including the typical
en_US.UTF-8), it insteadwalks an internal UTF-16LE buffer holding the executable path but uses a
4-byte stride (as if the buffer were UTF-32LE). Each "code point" it
pulls is therefore two adjacent ASCII bytes packed into the low and
next-higher bytes of a single 24-bit value. That value is then emitted as
an extended 5-byte UTF-8 sequence (the
0xF8-prefixed encoding used torepresent code points beyond U+10FFFF).
The net result for, say, a Python process whose
/proc/<pid>/exeresolvesto:
is that the returned buffer looks like ~180 bytes of
f8 …chunksfollowed by the correctly-encoded trailing basename, e.g.:
Decoding the first chunk illustrates the pattern:
f8 9a 80 80 afdecode as the extended-UTF-8 code point0x68002F'h'(0x68) in the high byte and'/'(0x2F)in the low byte — i.e. the source ASCII bytes
/hread as alittle-endian 32-bit value padded with zeros
Every chunk has this structure; together they spell out the prefix
/home/mdboom/.local/share/uv/python/cpython-3.14.0-linux-x86_64-gnu/bin/two characters at a time. The trailing
/python3.14is unaffected becauseof where the buggy stride leaves the cursor.
Why the workaround needs to "re-prime"
nvmlSystemGetProcessNameis cache-driven: the per-PID name is populatedthe first time NVML enumerates compute processes that include the PID
(typically via
nvmlDeviceGetComputeRunningProcesses_v3). Critically:"C"locale andre-reading does not unscramble it — the cache survives the locale
flip.
"C"locale overwrites the cachedentry with the correct UTF-8 string. Subsequent reads (in any locale)
then return correctly.
So the workaround must do prime + read together under
"C".Behaviour after this PR
get_process_namenow primes the NVML cache automatically. This makesit usable from a fresh process (previously a caller had to have
manually queried
device.compute_running_processesfirst or acceptNotFoundError).so the returned name is the correct UTF-8 string regardless of the
caller's locale.
Discussion
cuda_bindingslayer and fix itfor all
cuda_bindingsusers? In that case I guesscuda_coreshould raisean exception from
get_process_nameif on WSL and thecuda_bindingsinstalled is too old?
generally be avoided.